Data used in this work is House Sale Prices in Ames, Iowa, USA from 2006-2010, downloaded from Kaggle : https://www.kaggle.com/c/house-prices-advanced-regression-techniques (train.csv) .
library(tidyverse)
df_ori=read.csv('train.csv')
df_ori=tbl_df(df_ori)
df_ori
Data structures :
str(df_ori)
## tibble [1,460 x 81] (S3: tbl_df/tbl/data.frame)
## $ Id : int [1:1460] 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int [1:1460] 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr [1:1460] "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int [1:1460] 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int [1:1460] 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr [1:1460] "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr [1:1460] NA NA NA NA ...
## $ LotShape : chr [1:1460] "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr [1:1460] "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr [1:1460] "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr [1:1460] "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr [1:1460] "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr [1:1460] "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr [1:1460] "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr [1:1460] "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr [1:1460] "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr [1:1460] "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int [1:1460] 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int [1:1460] 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int [1:1460] 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int [1:1460] 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr [1:1460] "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr [1:1460] "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr [1:1460] "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr [1:1460] "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr [1:1460] "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : int [1:1460] 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr [1:1460] "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr [1:1460] "TA" "TA" "TA" "TA" ...
## $ Foundation : chr [1:1460] "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr [1:1460] "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr [1:1460] "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr [1:1460] "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr [1:1460] "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : int [1:1460] 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : chr [1:1460] "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int [1:1460] 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int [1:1460] 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int [1:1460] 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : chr [1:1460] "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr [1:1460] "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr [1:1460] "Y" "Y" "Y" "Y" ...
## $ Electrical : chr [1:1460] "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int [1:1460] 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int [1:1460] 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int [1:1460] 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int [1:1460] 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int [1:1460] 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int [1:1460] 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int [1:1460] 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int [1:1460] 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int [1:1460] 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int [1:1460] 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr [1:1460] "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int [1:1460] 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr [1:1460] "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int [1:1460] 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr [1:1460] NA "TA" "TA" "Gd" ...
## $ GarageType : chr [1:1460] "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : int [1:1460] 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : chr [1:1460] "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : int [1:1460] 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int [1:1460] 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr [1:1460] "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr [1:1460] "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr [1:1460] "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int [1:1460] 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int [1:1460] 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int [1:1460] 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int [1:1460] 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int [1:1460] 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int [1:1460] 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr [1:1460] NA NA NA NA ...
## $ Fence : chr [1:1460] NA NA NA NA ...
## $ MiscFeature : chr [1:1460] NA NA NA NA ...
## $ MiscVal : int [1:1460] 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int [1:1460] 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int [1:1460] 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr [1:1460] "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr [1:1460] "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int [1:1460] 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
df_ori %>% is.na() %>% colSums()
## Id MSSubClass MSZoning LotFrontage LotArea
## 0 0 0 259 0
## Street Alley LotShape LandContour Utilities
## 0 1369 0 0 0
## LotConfig LandSlope Neighborhood Condition1 Condition2
## 0 0 0 0 0
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 0 0 0 0 0
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## 0 0 0 0 0
## MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 8 8 0 0 0
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 37 37 38 37 0
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## 38 0 0 0 0
## HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF
## 0 0 1 0 0
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 0 0 0 0 0
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 0 0 0 0 0
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 0 0 690 81 81
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## 81 0 0 81 81
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 0 0 0 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## 0 0 1453 1179 1406
## MiscVal MoSold YrSold SaleType SaleCondition
## 0 0 0 0 0
## SalePrice
## 0
Let’s exclude Alley, FireplaceQu, PoolQC, Fence, MiscFeature since they have many missing values. Id column also need to be removed because it won’t give us any useful informations.
df_ori=df_ori %>% select(-c(Id, Alley, FireplaceQu, PoolQC, Fence, MiscFeature))
SalePrice is the target variable of this dataset and since there is no NA values in this column thus it can be directly plotted or analyzed.
library(patchwork)
p1=df_ori %>%
ggplot(aes(x=SalePrice))+geom_histogram()+theme_bw()
p2=df_ori %>%
ggplot(aes(x=SalePrice))+geom_boxplot()+theme_bw()+theme(axis.text.y = element_blank(),axis.ticks.y=element_blank())
p1+p2+plot_annotation(title='The Histogram and Boxplot of SalePrice Variable')
summary(df_ori$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
From the statistic summary of SalePrice, 50% of SalePrices from 2006-2010 in Ames, Iowa ranged from 129975-214000 $ meanwhile the maximum value is very far from that range, i.e 755000, indicating that the distribution of SalePrices is likely to be positively skewed (proven to be true by the histogram and the boxplot). Moreover, these results are quite intuitive, such as many outliers detected in the right tail side of the boxplot indicating that there are exceptional houses valued so expensively and much higher than Ames’ standard house prices. From the author point of view, this phenomenon is common for most cities/states in the world since each city tends to have its own elite neighborhood, for example Beverly Hills in California. Other words to explain this phenomenon is there are no limitations for how luxurious or expensive a house can be but of course there are standards for a building can still be stated as house.
The OverallQuall is the ordinal variable which determines the rating about the overall material and finish of the house , scaled from 1 (very poor) - 10 (very excellent).
p1=df_ori %>% ggplot(aes(y=SalePrice,fill=OverallQual))+geom_boxplot()+theme_bw()+theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())+facet_grid(~OverallQual)+labs(title = 'SalePrice vs OverallQuall',x='Overall Quality')
p2=df_ori %>% ggplot(aes(x=OverallQual,y=SalePrice))+geom_jitter(aes(color=OverallQual),width = 0.25)+scale_x_discrete(limits=c(1:10))+theme_minimal()+labs(title = 'SalePrice vs OverallQuall',x='Overall Quality')
p1/p2
This variable will be analyzed by plotting the relationship between OverallQuall and SalePrice and it gives result that the SalePrice tends to go higher (the median increases) as the Overall Quality increases . This result considered reasonable as the better quality of materials and finish of the house will also offer higher price. From the plot one also can clearly see intuitive result that houses with Excellent+ and Poor- are exceptional and rare , meanwhile houses with Average-Good quality are frequent.
p1=df_ori %>% ggplot(aes(x=OverallQual))+geom_histogram(binwidth = 1,fill='steelblue')+theme_minimal()+labs(title='Distribution of OverallQual',subtitle = 'The distribution of OverallQual nearly fits the normal distribution')
p1
summary(df_ori$OverallQual)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 6.000 6.099 7.000 10.000
Half of the houses in Ames, Iowa are rated 5 (Average) to 7 (Good).
In question 2, we concluded that one of the strong predictors for SalePrice is YearBuilt which is none other than the original construction date. The plot of YearBuilt vs SalePrice also shows that newer houses tend to get higher prices. However, are we sure of this insight? Or maybe there are actually other reasons that make newer houses get higher prices? To evaluate this question, let’s take a look at the previous Spearman correlation heatmap to check other variables that correlate with YearBuilt :
Let’s evaluate those variables one by one :
p1=df_ori %>% ggplot(aes(y=YearBuilt,fill=OverallQual))+geom_boxplot()+theme_bw()+theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())+facet_grid(~OverallQual)+labs(title = 'YearBuilt vs OverallQuall',x='Overall Quality')
p2=df_ori %>% ggplot(aes(x=OverallQual,y=YearBuilt))+geom_jitter(aes(color=OverallQual),width = 0.25)+geom_smooth(method='lm',color='black')+scale_x_discrete(limits=c(1:10))+theme_minimal()+labs(title = 'YearBuilt vs OverallQuall',x='Overall Quality')
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?
p3=df_ori %>% ggplot(aes(x=YearBuilt,y=SalePrice))+geom_point(aes(color=OverallQual))+geom_smooth(method='lm',color='black')+theme_minimal()+labs(title='SalePrice vs YearBuilt')
p4=ggplot()+theme_minimal()
(p1+p3)/(p2+p4)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
p1=df_ori %>% ggplot(aes(y=YearBuilt))+geom_boxplot()+theme_bw()+theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())+facet_grid(~GarageCars)+labs(title = 'YearBuilt vs GarageCars',x='GarageCars')
p2=df_ori %>% ggplot(aes(x=GarageCars,y=YearBuilt))+geom_jitter(aes(color=cut_number(YearBuilt,4)),width = 0.25)+scale_x_discrete(limits=c(0:4))+theme_minimal()+labs(title = 'YearBuilt vs GarageCars',x='YearBuilt Binning',color='YearBuilt Binning (Equal Amount)')
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?
p3=df_ori %>% mutate(GarageCarsChara=as.character(GarageCars)) %>%
ggplot(aes(x=YearBuilt,y=SalePrice))+geom_point(aes(color=GarageCarsChara))+geom_smooth(method='lm')+theme_minimal()+scale_color_manual(values = c("0" = "purple","1"="orange","2"="steelblue","3"="green","4"="brown")) +labs(title='SalePrice vs YearBuilt',color='GarageCars')
p4=df_ori %>% ggplot(aes(x=GarageArea,y=YearBuilt))+geom_point(aes(color=cut_number(YearBuilt,4)))+theme_minimal()+labs(color='YearBuilt Binning (Equal Amount)',title='YearBuilt vs GarageArea')
(p1+p3)/(p2+p4)
## `geom_smooth()` using formula 'y ~ x'
p1=df_ori %>% mutate(FullBathChara=as.character(FullBath)) %>%
ggplot(aes(x=YearBuilt,y=SalePrice))+geom_point(aes(color=FullBathChara))+geom_smooth(method='lm')+theme_minimal()+scale_color_manual(values = c("0" = "purple","1"="orange","2"="steelblue","3"="green")) +labs(title='SalePrice vs YearBuilt',color='FullBath')
p2=df_ori %>%
ggplot(aes(y=YearBuilt))+geom_boxplot()+facet_grid(~FullBath)+theme_bw()+theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())+labs(title='YearBuilt vs FullBath')
p3=df_ori %>% ggplot(aes(x=FullBath,y=YearBuilt))+geom_jitter(aes(color=cut_number(YearBuilt,3)))+theme_minimal()+labs(title='YearBuilt vs FullBath',color='YearBuilt Binning (Equal Amount)')
p4=df_ori %>% ggplot()+theme_minimal()
(p2+p1)/(p3+p4)
## `geom_smooth()` using formula 'y ~ x'
library(gghighlight)
p1=df_ori %>% ggplot(aes(x=GrLivArea,y=SalePrice))+geom_point(color='red',shape=20,size=3)+theme_minimal()+labs(title='SalePrice vs GrLivArea')+gghighlight(GrLivArea > 4500,unhighlighted_params = aes(fill = NULL, alpha = 0.5,shape=18))+labs(subtitle = 'There are 2 houses that do have large GrLivArea but priced unnaturally low')+geom_text(aes(y=SalePrice*1.2,label='haunted?'))
p1
One can easily see that houses in dataset commonly have GrLivArea ranged from 1000-2500 sqft. Houses with GrLivArea > 4000 are exceptional and very rare to be found. However, there are 2 houses that will become our observation here (I highlight these houses with red color) because they have unnatural low price for their large area. They are easily recognized as the largest above grade living area in this dataset (>4500 sqft), but how come their price are below 200000 $ given that GrLivArea is strongly correlated with the SalePrice?
df_ori %>% filter(GrLivArea>4500)
From SaleType and SaleCondition, we accept informations such as the houses were actually just constructed and sold at the time (2007 and 2008) but hadn’t been completed yet when last assessed. Since they are new and just constructed, i do not think these houses give such frightening vibes hence people assumed they are haunted. Given they also have huge GrLivArea and 10/10 OverallQual , other important numerical variables such as GarageCar, FullBath, TotalBsmtSF, etc should also have good value. Other than numerical variables, I think categorical variables must be analyzed too, aren’t they? To do that, I will create a function to plot all categorical variables vs SalePrice distribution simultaneously. I will also highlight the category related to the observed houses to make the plots are easier to read.
library(Rmisc)
library(grid)
Plotme=function(vector_column){
p=list()
for (i in 1:length(vector_column)){
highlight_category=df_ori[df_ori$GrLivArea>4500,][vector_column[i]]
p[[i]]=ggplot(df_ori)+geom_boxplot(aes_string(x=vector_column[i],y='SalePrice'))+theme_minimal()+theme(axis.text.x = element_text(angle=90))+labs(title=paste('SalePrice vs',vector_column[i]),subtitle = paste('Highlight :',highlight_category[1,1],'and',highlight_category[2,1]))
}
multiplot(plotlist=p,cols=4)
}
df_ctg=df_ori %>%
select_if(~!any(is.numeric(.))) #Select categorical data
ctg=colnames(df_ctg) #categorical variables
Plotme(ctg)
From SalePrice vs SaleCondition, one can see that Partial are roughly higher than others hence we can’t assume that the unnatural low prices come from it. SalePrice vs SaleType plot also supports this statement by showing the price of new houses tend to be higher. Unfortunately, I can’t extract much useful informations from other plots that may answer our problem since almost all category variables do not show any clear influences on SalePrice and the observed houses also do not have any bad quality reviews. However, I suspect 1 variable that may can help us answer this question, namely Neighborhood. The Neighborhood of our observed houses is Edwards.
library(ggridges) #library to plot the ridgline plot
df_ori %>%
ggplot(aes(x=SalePrice/1000,y=Neighborhood,fill=Neighborhood))+geom_density_ridges()+theme_minimal()+labs(x='Sale Prices in 1000$',y='Neighborhood')+theme_ridges(font_size = 11,center_axis_labels = TRUE)+theme(legend.position = 'none')+labs(title='Sale Price for Each Neighborhood',subtitle='Edwards seems to have low price distribution compared to others')
Let’s sort the Neighborhood from the lowest SalePrice median.
x=df_ori %>% #Neighborhood with low price
group_by(Neighborhood) %>%
dplyr::summarise(count=n(),medianSalePrice=median(SalePrice),meanQuall=mean(OverallQual)) %>% arrange(medianSalePrice)
che=x[1:10,'Neighborhood'] #save 10 neighborhoods with the lowest SalePrice median
che=che[[1]]
x
After sorting the median of Price Distribution, we can confirm that Edwards is one of the Neighborhoods with the lowest median of sale price distribution (5th), i.e 121750, roughly half the price of the 3 most elite Neighborhood such as StoneBr (278000), NoRidge (301500), and NridgHt (315000).To prove the influences of Neighborhood, let’s compare the SalePrice vs GrLivArea trendline between 10 lowest neighborhood and others neighborhood.
df_ori %>% ggplot(aes(x=GrLivArea,y=SalePrice,color=Neighborhood %in%che))+geom_point()+geom_smooth(method='lm')+theme_minimal()+scale_color_discrete('10 Neighborhood with the lowest price')+labs(title='GrLivArea vs SalePrice',subtitle='10 Neighborhood with the lowest price have smaller trendline slope')
## `geom_smooth()` using formula 'y ~ x'
From above graph, I think we have found one answer to our question which is Neighborhood plays a big role in the measurement of houses sale price or in other words, these houses have unnaturally low prices because they are located in 10 Neighborhood with the lowest price.
But, is it all?
I am not sure if Neighborhood is the one and only answer, but I am not sure either which (category variables other than neighborhood) that also ‘play’ here. However, I do have other suspicions and this might be a controversial take from me. Let’s visualize the SalePrice vs YrSold using jitter plot.
df_ori %>% ggplot(aes(x=YrSold,y=SalePrice))+geom_jitter(color='gray')+geom_smooth(color='black',method='lm')+theme_minimal()+labs(title='Great Recession Era',subtitle='Great Recession starts from December 2007 - June 2009')
## `geom_smooth()` using formula 'y ~ x'
According to https://www.investopedia.com/terms/g/great-recession.asp , The Great Recession was the sharp decline in economic activity during the late 2000s. It is considered the most significant downturn since the Great Depression. The term Great Recession applies to both the U.S. recession, officially lasting from December 2007 to June 2009, and the ensuing global recession in 2009. The economic slump began when the U.S. housing market went from boom to bust, and large amounts of mortgage-backed securities (MBS’s) and derivatives lost significant value. In Ames however, we get interesting result that that the prices were stagnant during 2006-2010. These years was surely a nightmare for all real estate inventors because the prices didn’t raise year after year. Coincidently, our 2 observed houses were sold exactly at the beginning of the recession, i.e October 2007 and January 2008. So, I think right now we understand the situation back at the time.
Other thing that i want to emphasize is the Supply vs Demand concept. Let’s take a look again at the jitter plot of YrSold vs GrLivArea.
p1=df_ori %>% ggplot(aes(x=YrSold,y=GrLivArea))+geom_jitter(color='gray')+geom_smooth(method='lm',color='black')+theme_minimal()+labs(title='YrSold vs GrLivArea')
p2=df_ori %>% ggplot(aes(y=GrLivArea))+geom_boxplot()+theme_minimal()+labs(title='Distribution of GrLivArea')+theme(axis.text.x = element_blank())
p1+p2
## `geom_smooth()` using formula 'y ~ x'
q3=quantile(df_ori$GrLivArea,0.75)
q1=quantile(df_ori$GrLivArea,0.25)
out_lim=q3+1.5*(q3-q1)
n=df_ori %>% filter(GrLivArea<=out_lim) %>% select(YrSold) %>% summarise_all(funs(n()))
print(paste('The minimum GrLivArea is :',min(df_ori$GrLivArea)))
## [1] "The minimum GrLivArea is : 334"
print(paste('The mean of GrLivArea is :',mean(df_ori$GrLivArea)))
## [1] "The mean of GrLivArea is : 1515.46369863014"
print(paste('The median of GrLivArea is :',median(df_ori$GrLivArea)))
## [1] "The median of GrLivArea is : 1464"
print(paste('Percentage of houses exclude outlier is ',as.integer(n[[1]])/1460))
## [1] "Percentage of houses exclude outlier is 0.978767123287671"
print(paste('Q3 + 1.5IQR =',out_lim))
## [1] "Q3 + 1.5IQR = 2747.625"
By simple calculation, we know that 97.87% of houses sold within 2006-2009 have GrLivArea < 2747.625 sqft. In other words, from our sample, we know that in Ames, 97.87% demands during that period are houses with GrLivArea ranged from 334-2748 sqft. So let’s pretend back at the time that you want to sell a house : Do you want to take such a big risk, sell a quite large houses with GrLivArea > 4500 sqft by quite expensive price, located in Edwards, at the beginning of the Great Recession? I think the answer is simply no, isn’t it?
To end this long explanation, I want to engineer a new feature, called Price_Area, which is none other than SalePrice/GrLivArea. Based on my hypotheses, SalePrice/GrLivArea will decrease after reaching a certain point of GrLivArea (house will be less valuable if it is too large), so the slope of Price_Area linear trendline is negative as GrLivArea increases. To get stronger evidence, i will exclude these observed houses from the plot so the trendline won’t be affected by these ‘haunted’ houses.
df_cad=df_ori %>% mutate(Price_Area=SalePrice/GrLivArea)
df_cad %>% filter(GrLivArea < 4500) %>% ggplot(aes(x=GrLivArea,y=Price_Area))+geom_point()+geom_smooth(method='lm')+facet_grid(cols=vars(YrSold))+theme_bw()
## `geom_smooth()` using formula 'y ~ x'
As I expected, the slope will be negative and reach its peak during 2007-2008 (the climax of Great Recession). At 2006 and 2009 the slope is very close to 0 and return to be negative at 2010 (please note that the data was recorded until July 2010).
This is just my opinion, so please correct me if I am wrong. After all, I think these houses absolutely are not haunted, maybe it was just such a right decision to sell them cheaply back at the time.
I got this insight when I was doing the analysis of the previous number. Let’s plot MoSold vs the amount of sales during 2006-2010.
df_ori %>% ggplot(aes(x=MoSold))+geom_bar(stat = 'count')+facet_grid(cols=vars(YrSold))+coord_flip()+theme_bw()+scale_x_discrete(limits=c(1:12))+labs(title='Count of Sales for Each Month during 2006-2010',subtitle = 'The amount of sales always reach its peak at June - July')
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?
From above graph, one can clearly see that more houses are sold in June-July period. For your information, June usually is the beginning of the summer season and this is where season plays a big role. Summer is always be a busy season for house hunting. It is the season of weddings, holiday and the time when families prepare to send their children to new schools. These are three popular reasons for moving home or buying a new house. There are already many articles about this on the internet. If you are interested, you can read one of them : https://slate.com/culture/2008/09/an-economics-mystery-why-houses-cost-more-in-summer-has-finally-been-solved.html .
How about the prices? Well remember back at the time, The Great Recession happened.
mosold_chara=as.character(c(1:12))
df_ori %>% ggplot(aes(x=as.character(MoSold),y=SalePrice))+geom_boxplot()+scale_x_discrete(limits=mosold_chara)+theme_minimal()+labs(title='Price vs MoSold',subtitle = 'Price tends to be stagnant throughout the month',x='Month',y='SalePrice')